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Abstract 



The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used 
in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its 
fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods 
are used to solve reinforcement learning problems, where two new reinforcement learning 
algorithms using linear value function approximators are proposed and analyzed. The two 
algorithms are called RLS-TD( X ) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. 
RLS-TD( X ) can be viewed as the extension of RLS-TD(O) from X =0 to general 0< X <1 , so it is 
a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence 
with probability one and the limit of convergence of RLS-TD( X ) are proved for ergodic Markov 
chains. Compared to the existing LS-TD( X ) algorithm, RLS-TD( X ) has advantages in 
computation and is more suitable for online learning. The effectiveness of RLS-TD( X ) is 
analyzed and verified by learning prediction experiments of Markov chains with a wide range of 
parameter settings. 

The Fast-AHC algorithm is derived by applying the proposed RLS-TD( X ) algorithm in the 
critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, 
Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. 
Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are 
conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the 
experimental results, it is shown that the data efficiency of learning control can also be improved 
by using RLS methods in the learning-prediction process of the critic. The performance of 
Fast-AHC is also compared with that of the AHC method using LS-TD(X). Furthermore, it is 
demonstrated in the experiments that different initial values of the variance matrix in RLS-TD( X ) 
are required to get better performance not only in learning prediction but also in learning control. 
The experimental results are analyzed based on the existing theoretical work on the transient 
phase of forgetting factor RLS methods. 



1. Introduction 

In recent years, reinforcement learning (RL) has been an active research area not only in machine 
learning but also in control engineering, operations research and robotics (Kaelbling et al.,1996; 
Bertsekas, et al.,1996; Sutton and Barto,1998; Lin,1992). It is a computational approach to 
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understand and automate goal-directed learning and decision-making, without relying on 
exemplary supervision or complete models of the environment. In RL, an agent is placed in an 
initial unknown environment and only receives evaluative feedback from the environment. The 
feedback is called reward or reinforcement signal. The ultimate goal of RL is to learn a strategy 
for selecting actions such that the expected sum of discounted rewards is maximized. 

Since lots of problems in the real world are sequential decision processes with delayed 
evaluative feedback, the research in RL has been focused on theory and algorithms of learning to 
solve the optimal control problem of Markov decision processes (MDPs) which provide an 
elegant mathematical model for sequential decision-making. In operations research, many results 
have been presented to solve the optimal control problem of MDPs with model information. 
However, in reinforcement learning, the model information is assumed to be unknown, which is 
different from the methods studied in operations research such as dynamic programming. In 
dynamic programming, there are two elemental processes, which are the policy evaluation pro- 
cess and the policy improvement process, respectively. In RL, there are two similar processes. 
One is called learning prediction and the other is called learning control. The goal of learning 
control is to estimate the optimal policy or optimal value function of an MDP without knowing its 
model. Learning prediction aims to solve the policy evaluation problem of a stationary-policy 
MDP without any prior model and it can be regarded as a sub-problem of learning control. 
Furthermore, in RL, learning prediction is different from that in supervised learning. As pointed 
out by Sutton (1988), the prediction problems in supervised learning are single-step prediction 
problems while those in reinforcement learning are multi-step prediction problems. To solve 
multi-step prediction problems, a learning system must predict outcomes that depend on a future 
sequence of decisions. Therefore, the theory and algorithms for multi-step learning prediction 
become an important topic in RL and much research work has been done in the literature (Sutton, 
1988; Tsitsiklis and Roy, 1997). 

Among the proposed multi-step learning prediction methods, temporal-difference (TD) 
learning (Sutton, 1988) is one of the most popular methods. It was studied and applied in the early 
research of machine learning, including the celebrated checkers-playing program (Minsky, 1 954; 
Samuel, 1959). In 1988, Sutton presented the first formal description of temporal- difference 
methods and the TD(\) algorithm (Sutton, 1988). Convergence results are established for tabular 
temporal-difference learning algorithms where the cardinality of tunable parameters is the same 
as that of the state space (Sutton, 1988; Watkins,et al.,1992; Dayan,et al., 1994; Jaakkola, et 
al.,1994). Since many real-world applications have large or infinite state space, value function 
approximation (VFA) methods need to be used in those cases. When combined with nonlinear 
value function approximators, TD( X ) can not guarantee convergence and several results 
regarding divergence have been reported in the literature (Tsitsiklis and Roy, 1997). For TD(Ji) 
with linear function approximators, also called linear TD(5i) algorithms, several convergence 
proofs have been presented. Dayan (1992) showed the convergence in the mean for linear TD(?i) 
algorithms with arbitrary < A. < 1 . Tsitsiklis and Roy (1994) proved the convergence for a 
special class of TD learning algorithms, known as TD(0), while in Tsitsiklis and Roy (1997), they 
extended the early results to general linear TD(A) case and proved the convergence with 
probability one. 

The above linear TD(A) algorithms have rules for updating parameters similar to those in 
gradient-descent methods. However, as in gradient-learning methods, a step-size schedule must 
be carefully designed not only to guarantee convergence but also to obtain good performance. In 
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addition, there is inefficient use of data that slows the convergence of the algorithms. Based on 
the theory of linear least-squares estimation, Brartke and Barto (1996) proposed two 
temporal-difference algorithms called the Least-Squares TD(0) algorithm (LS-TD(O)) and the 
Recursive Least- Squares TD(0) algorithm (RLS-TD(O)), respectively. LS-TD(O) and RLS-TD(O) 
are more efficient in a statistical sense than conventional linear TD(X) algorithms and they 
eliminate the design of step-size schedules. Furthermore, the convergence of LS-TD(O) and 
RLS-TD(O) has been provided in theory. The above two algorithms can be viewed as the 
least-squares versions of conventional linear TD(0) methods. However, as has been shown in the 
literature, TD learning algorithms such as TD(?i) with 0<A.<1 that update predictions based on 
the estimates of multiple steps are more efficient than Monte-Carlo methods as well as TD(0). By 
employing the mechanism of eligibility traces, which is determined by X, TD().) algorithms 
with 0<X<1 can extract more information from historical data. Recently, a class of linear 
temporal-difference learning algorithms called LS-TD( X ) has been proposed by Boyan 
(1999,2002), where least-squares methods are employed to compute the value-function estimation 
of TD(X) with 0<X<\. Although LS-TD(?i) is more efficient than TD(X), it requires too much 
computation per time-step when online updates are needed and the number of state features 
becomes large. 

In system identification, adaptive filtering and adaptive control, the recursive least-squares 
(RLS) (Young, 1984; Ljung, 1983; Ljung,1977) method, commonly used to reduce the 
computational burden of least-squares methods, is more suitable for online estimation and control. 
Although RLS-TD(O) makes use of RLS methods, it does not employ the mechanism of 
eligibility traces. Based on the work of Tsitsiklis and Roy (1994, 1997), Boyan (1999,2002) and 
motivated by the above ideas, a new class of temporal-difference learning methods, called the 
RLS-TD( X ) algorithm, is proposed and analyzed formally in this paper. RLS-TD( X) is superior 
to conventional linear TD( X ) algorithms in that it makes use of RLS methods to improve the 
learning efficiency in a statistical point of view and eliminates the step-size schedules. 
RLS-TD(?l) has the mechanism of eligibility traces and can be viewed as the extension of 
RLS-TD(O) from X=0 to general 0<A,<1. The convergence with probability 1 of RLS-TD(A.) is 
proved for ergodic Markov chains and the limit of convergence is also analyzed. In learning 
prediction experiments for Markov chains, the performance of RLS-TD(X) and TD(A.) as well as 
LS-TD(Ji) is compared, where a wide range of parameter settings is tested. In addition, the in- 
fluence of the initialization parameters in RLS-TD(A.) is also discussed. It is observed that the 
rate of convergence is influenced by the initialization of the variance matrix, which is a 
phenomenon investigated theoretically in adaptive filtering (Moustakides, 1997; Haykin, 1996). 

As will be analyzed in the following sections, there are two benefits of the extension from 
RLS-TD(O) to RLS-TD(A.). One is that the value of X (0<A.<1) will still affect the performance 
of the RLS-based temporal-difference algorithms. Although for RLS-TD( X ), the rate of 
convergence is mainly influenced by the initialization of the variance matrix, the bound of 
approximation error is dominantly determined by the parameter X . The smallest error bound can 
be obtained for ^,=1 and the worst bound is obtained for X=0. These bounds suggest that the 
value of X should be selected appropriately to obtain the best approximation error. The second 
benefit is that RLS-TD( X ) is more suitable for online learning than LS-TD( X ) since the 
computation per time-step is reduced from 0(K 3 ) to 0(K 2 ), where K is the number of state 
features. 

The Adaptive-Heuristic-Critic (AHC) learning algorithm is a class of reinforcement learning 
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methods that has an actor-critic architecture and can be used to solve full reinforcement learning 
or learning control problems. By applying the RLS-TD(?i) algorithm in the critic, the Fast-AHC 
algorithm is proposed in this paper. Using RLS methods in the critic, the performance of learning 
prediction in the critic is improved so that learning control problems can be solved more 
efficiently. Simulation experiments on the learning control of the cart-pole balancing problem and 
the swing-up of an acrobot are conducted to verify the effectiveness of the Fast-AHC method. By 
comparing with conventional AHC methods which use TD( X ) in the critic, it is demonstrated that 
Fast-AHC can obtain higher data efficiency than conventional AHC methods. Experiments on the 
performance comparisons between AHC methods using LS-TD(A-) and Fast-AHC are also 
conducted. In the learning control experiments, it is also illustrated that the initializing constant of 
the variance matrix in RLS-TD( X ) influences the performance of Fast-AHC and different values 
of the constant should be selected to get better performance in different problems. The above 
results are analyzed based on the theoretical work on the transient phase of RLS methods. 

This paper is organized as follows. In Section 2, an introduction on the previous linear 
temporal-difference algorithms is presented. In Section 3, the RLS-TD(?i) algorithm is proposed 
and its convergence (with probability one) is proved. In Section 4, a simulation example of the 
value-function prediction for absorbing Markov chains is presented to illustrate the effectiveness 
of the RLS-TD( X ) algorithm, where different parameter settings for different algorithms 
including LS-TD(A.) are studied. In Section 5, the Fast-AHC method is proposed and the 
simulation experiments on the learning control of the cart-pole balancing and the acrobot are 
conducted to compare Fast-AHC with the conventional AHC method as well as the 
LS-TD(?t)-based AHC method. Some simulation results are presented and analyzed in detail. The 
last section contains concluding remarks and directions for future work. 

2. Previous Work on Linear Temporal-Difference Algorithms 

In this section, a brief discussion on the conventional linear TD(?i) algorithm and RLS-TD(O) as 
well as the LS-TD( X ) algorithm will be given. First of all, some mathematical notations are 
presented as follows. 

Consider a Markov chain whose states lie in a finite or countable infinite space S. The states 
of the Markov chain can be indexed as {1,2,... ,n), where n is possibly infinite. Although the 
algorithms and the results in this paper are applicable to Markov chains with general state space, 
the discussion in this paper will be restricted within the cases with a countable state space to 
simplify the notation. The extension to Markov chains with a general state space only requires the 
translation of the matrix notation into operator notation. 

Let the trajectory generated by the Markov chain be denoted by {x, |/=0,1,2,...; x, eS}.The 
dynamics of the Markov chain is described by a transition probability matrix P whose (y)-th 
entry, denoted by py, is the transition probability for x, +l =j given that x,=i. For each state transition 
from x, to Xt+i, a scalar reward r, is defined. The value function of each state is defined as follows: 

CO 

V(i) = E(£ r >r t \x =i} (1) 

(=0 

where 0< y < 1 is a discount factor. 

In the TD( X ) algorithm, there are two basic mechanisms which are the temporal difference 
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and the eligibility trace, respectively. Temporal differences are defined as the differences between 
two successive estimations and have the following form. 

S t =r t+r V t (x t+l )-V t (x t ) (2) 

where x t+i is the successive state of x t , V(x) denotes the estimate of the value function V(x) and r, 

is the reward received after the state transition from x, to x,+\. 

The Eligibility trace can be viewed as an algebraic trick to improve learning efficiency 
without recording all the data of a multi-step prediction process. This trick is based on the idea of 
using the truncated return of a Markov chain. In temporal-difference learning with eligibility 
traces, an «-step truncated return is defined as 

R:=r t +yr M +... + r "- i r t+n _, +y"V t {s t+n ) (3) 

For an absorbing Markov chain whose length is T, the weighted average of truncated returns 

is 

T-t-i 

= (i - X) X ^"~ l R" + ^ M R T (4) 

where < X < 1 is a decaying factor and Rj= r t + yr l+x + ... + y T r T is the Monte-Carlo return at 
the terminal state. In each step of the TD(A.) algorithm, the update rule of the value function 
estimation is determined by the weighted average of truncated returns defined above. The 
corresponding update equation is 

AV t ( Sj ) = a t (R?-V t ( Si )) (5) 

where a t is a learning factor. 

The update equation (5) can be used only after the whole trajectory of the Markov chain is 
observed. To realize incremental or online learning, eligibility traces are defined for each state as 
follows: 

The online TD( X ) update rule with eligibility traces is 

V t+l (s i ) = V t (s i ) + a t S t z M (s i ) 0) 

where 3, is the temporal difference at time step t, which is defined in (2) and z (s)=0 for all s. 

Since the state space of a Markov chain is usually large or infinite in practice, function 
approximators such as neural networks are commonly used to approximate the value function. 
TD( X ) algorithms with linear function approximators are the most popular and well-studied ones. 

Consider a general linear function approximator with a fixed basis function vector 

&(X) = (^1 <I>1 (*)>•••> <t>n ( X )) T 

The estimated value function can be denoted as 

V t { X ) = f{x)W t (8) 
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where W, =(w\_ wi w„) T is the weight vector. 

The corresponding incremental weight update rule is 

W M =W t +a t (r t + r f (x t+l W t ~f (*, )W t )z M (9) 

where the eligibility trace vector z t (s) = (z u (s),z 2l (s),...,z nl (s)) T is defined as 

^ + i=7^<+0(*<) ( 10 ) 

In Tsitsiklis and Roy (1997), the above linear TD(A.) algorithm is proved to converge with 
probability 1 under certain assumptions and the limit of convergence W* is also derived, which 
satisfies the following equation. 

E [A(X t )]W* -E,[b{X t )] = Q (11) 

where X, =(x t jc t +\,z t+ {) (£=1,2,...) form a Markov process, E [-] stands for the expectation with 
respect to the unique invariant distribution of {X t }, and A(X t ) and b(X t ) are defined as 

A(X t ) = z t (<?> T (x t )-yf(x t+l )) (12) 

b(X t ) = z t r t (13) 

To improve the efficiency of linear TD().) algorithms, least-squares methods are used with the 
linear TD(0) algorithm, and the LS-TD(O) and RLS-TD(O) algorithms are suggested (Brartke and 
Barto, 1996). In LS-TD(O) and RLS-TD(O), the following quadratic objective function is defined. 

J = %r t -{tf -r^Wf (14) 

Thus, the aim of LS-TD(O) and RLS-TD(O) is to obtain a least-squares estimation of the real 
value function which satisfies the following Bellman equation. 

V(x t ) = E[r t (x t ,x t+1 ) + yV(x t+l )] (15) 

By employing the instrumental variables approach (Soderstrom and Stoica, 1983), the 
least-squares solution of (14) is given as 

Wis-mo)=(L(.M* t -M + i) r )) _1 (S^n) (16) 

where <p t is the instrumental variable chosen to be uncorrected with the input and output noises. 

In RLS-TD(O), recursive least-squares methods are used to decrease the computational bur- 
den of LS-TD(O). The update rules of RLS-TD(O) are as follows: 

W t+X = W t +P l <fi l {r, -(</>< -M + i) r *W + (4 -M + if^) (17) 

p t+1 =p t -p&i&t -M + i)>, /(i+(^ -r^) T PA) (18) 

The convergence (with probability one) of LS-TD(O) and RLS-TD(O) is proved for periodic and 
absorbing Markov chains under certain assumptions (Brartke and Barto, 1996). 
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In Boyan (1999,2002), LS-TD(A.) is proposed by solving (11) directly and the model-based 
property of LS-TD( X ) is also analyzed. However, for LS-TD( X ), the computation per time-step is 
0(K 3 ), i.e., the cubic order of the state feature number. Therefore the computation required by 
LS-TD(A) increases very fast when K increases, which is undesirable for online learning. 

In the next section, we propose the RLS-TD(A) algorithm by making use of recursive 
least-squares methods so that the computational burden of LS-TD(A.) can be reduced from 0(K 3 ) 
to 0(^ 2 ). We also give a rigorous mathematical analysis on the algorithm, where the convergence 
(with probability 1) of RLS-TD( A.) is proved. 

3. The RLS-TD( X ) Algorithm 

For the Markov chain discussed above, when linear function approximators are used, the 
least-squares estimation problem of (11) has the following objective function. 

2 



J 



(19) 



where A(X t )eR"*",b(X t )eR" are defined as (12) and (13), respectively, ||| is a Euclid norm 

and n is the number of basis functions. 

In LS-TD( X ), the least-squares estimate of the weight vector W is computed according to the 
following equation. 

W LS _ TDW =A T l b T =(j^A(X,)y l {j^b{X t )) (20) 



where 

A T = f j (A(X t )) = f j z t (0 7 \x t )- r f \x t+1 )) (21) 



t=0 t=0 



b T =^b(X t ) = ^z t r t (22) 



1=0 t=0 



As is well known in system identification, adaptive filtering and control, RLS methods are 
commonly used to solve the computational and memory problems of least-squares algorithms. In 
the sequel, we present the RLS-TD( X ) algorithm based on the above idea. First, the matrix in- 
verse lemma is given as follows: 



Lemma l(Ljung, et al.,1983). // Ag R" x " ,B g R" xl ,C g R 1 *" and A is invertible, then 

{A + BCy 1 = A~ l - A~ l B(I + CA~ x B)~ l CA~ X (23) 



Let 



P t =A~ x (24) 
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P =$ (25) 

K t+1 =P t+1 z t (26) 

where 5 is a positive number and / is the identity matrix. 

Then the weight update rules of RLS-TD( X ) are given by 

K t+l =P t z t /(ju + (f(x t )- r f(x t+l ))P t z t ) (27) 
W t+l =W t + K t+1 (r t -(f( Xt )-yf ( Xt+l )) Wf ) (28) 

P t+1 = -[P t -P t l[ M + (f(x t )-rf(x t+1 ))P t z t )]-\f(x t )- r f(x t+1 ))P t ] (29) 
M 

where for the standard RLS-TD(A.) algorithm, fj=\; for the general forgetting factor RLS-TD(A) 
case, 0</u<l. 

The forgetting factor /u (0<ft<\) is usually used in adaptive filtering to improve the 
performance of RLS methods in non-stationary environments. The forgetting factor RLS-TD(A.) 
algorithm with 0<[i<l can be derived using similar techniques as in Haykin (1996). The detailed 
derivation of RLS-TD(?t) is referred to Appendix A. 

In the follows, the descriptions of RLS-TD( A. ) for two different kinds of Markov chains are 
given. First, a complete description of RLS-TD( A.) for ergodic Markov chains is presented below. 



Algorithm 1 RLS-TD( X ) for ergodic Markov chains 



1: Given: 

• A termination criterion for the algorithm. 

• A set of basis functions {^(Z)} (j=\,2,...,ri) for each state i, where n is the 

number of basis functions. 
2: Initialize: 

(2.1) Let t=0. 

(2.2) Initialize the weight vector W t , the variance matrix P, , the initial state x . 

(2.3) Set the eligibility traces vector z =0. 

3: Loop: 

(3.1) For the current state x h observe the state transition from x, to x i+ \ and the 
reward r(x, ,x, +1 ). 

(3.2) Apply equations (27)-(29) to update the weight vector. 

(3.3) t=t+\. 

until the termination criterion is satisfied. 



The RLS-TD( X ) algorithm for absorbing Markov chains is a little different from the above 
algorithm in coping with the state features of absorbing states. Following is a description of 
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RLS-TD( A.) for absorbing Markov chains. 



Algorithm 2 RLS-TD( A. ) for absorbing Markov chains 



1: Given: 

• A termination criterion for the algorithm. 

• A set of basis functions {^(z)} (j=\,2,...,ri) for each state i, where n is the 

number of basis functions. 
2: Initialize: 

(2.1) Let t=0. 

(2.2) Initialize the weight vector W„ the variance matrix P, , the initial state x . 

(2.3) Set the eligibility traces vector z =0. 

3: Loop: 

(3.1) For the current state x t , 

• If x, is an absorbing state, set </> (x, +1 )=0, r(x t )=r T , where r T is the terminal 

reward. 

• Otherwise, observe the state transition from x, to x t +\ and the reward 

r(x, ,x l+ i). 

(3.2) Apply equations (27)-(29) to update the weight vector. 

(3.3) If x, is an absorbing state, re-initialize the process by setting x t +\ to an initial 
state and set the eligibility traces z t to a zero vector. 

(3.4) t=t+l. 

until the termination criterion is satisfied. 



In the above RLS-TD(A) algorithm for absorbing Markov chains, the weight updates in the 
absorbing states are treated differently and the process is re-initialized in absorbing states to 
transform the absorbing Markov chain into an equivalent ergodic Markov chain. So in the 
following convergence analysis, we only focus on ergodic Markov chains. 

Under similar assumptions as in Tsitsiklis and Roy (1997), we will prove that the proposed 
RLS-TD( A.) algorithm converges with probability one. 

Assumption 1. The Markov chain {x t }, whose transition probability matrix is P, is ergodic, and 
there is a unique distribution n that satisfies 

n T P = n T (30) 
with n (/)>0 for all i £ S and n is a finite or infinite vector, depending on the cardinality of S. 

Assumption 2. Transition rewards r(x lr x t+l ) satisfy 

E [r 2 ( Xl ,x t+l )]<co (31) 

where E [ ] is the expectation with respect to the distribution n . 

Assumption 3. The matrix <D = [$ l ,0 2 ,...,0 n ] eR Nxn has full column rank, that is, the basis 
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functions <j) { (z-1,2,. . .,«) are linearly independent. 

Assumption 4. For every i (z-1,2, . . . ,n), the basis function ^, satisfies 

£ [^,. 2 (x,)]<cx, (32) 

1 T 

Assumption 5. The matrix [P l + — ^ A(X, )] is non-singular for all T>0. 

T t =i 

Assumptions 1^ are almost the same as those for the linear TD(?t) algorithms discussed in 
Tsitsiklis and Roy (1997) except that in Assumption 1, ergodic Markov chains are considered. 
Assumption 5 is specially needed for the convergence of the RLS-TD(A.) algorithm. 

Based on the above assumptions, the convergence theorem for RLS-TD(A) can be given as 
follows: 

Theorem 1. For a Markov chain which satisfies Assumptions 1—5, the asymptotic estimate found 
by RLS-TD(h) converges, with probability 1, to W* determined by (11). 

For the proof of Theorem 1, please refer to Appendix B. The condition specified by 
Assumption 5 can be satisfied by setting P = SI appropriately 

According to Theorem 1, RLS-TD(?t) converges to the same solution as conventional linear 
TD(A.) algorithms do, which satisfies (11). So the limit of convergence can be characterized by 
the following theorem. 

Theorem 2 (Tsitsiklis and Roy ,1997) Let W* be the weight vector determined by (11) and V* be 

the true value function of the Markov chain, then under Assumption 1-4, the following relation 
holds. 

Ibr-rll <IZ^W-Kl| (33) 

II lb \-y II \\D V ' 

where \x\ D = 4x T DX , n = 0(0^/30) 1 <& T D . 

For more explanations on the notations in Theorem 2, please refer to Appendix B. 

As discussed by Tsitsiklis and Roy (1997), the above theorem shows that the distance of the 
limiting function O JV* from the true value function V* is bounded and the smallest bound of 
approximation error can be obtained when X=l. For every X<1, the bound actually deteriorates as 
X decreases. The worst bound is obtained when X=0. Although this is only a bound, it strongly 
suggests that higher values of A. are likely to produce more accurate approximations of V*. 

Compared to LS-TD(A), there is an additional parameter in RLS-TD(A.), which is the value 3 
for the initial variance matrix P . As was pointed out by Haykin (1996,pp.570), the exact value of 
the initializing constant 5 has an insignificant effect when the data length is large enough. This 
means that in the limit, the final solutions obtained by LS and RLS are almost the same. For the 
influence of d on the transient phase, when the positive constant S becomes large enough or goes 
to infinity, the transient behavior of RLS will be almost the same as that of LS methods (Ljung, 
1983). But when 3 is initialized with a relatively small value, the transient phases of RLS and LS 
will be different. In practice, it is observed that there is a variable performance of RLS as a 
function of the initialization of S (Moustakides, 1997). In some cases, RLS can exhibit a 
significantly faster convergence when initialized with a relatively small positive definite matrix 
than when initialized with a large one (Haykin, 1996; Moustakides, 1997; Hubing and Alexander, 
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1989). A first effort toward this direction is the statistical analysis of RLS for soft and exact 
initialization but limits to the case that the number of iterations is less than the size of the 
estimation vector (Hubing and Alexander, 1989). Moustakides (1997) provided a theoretical 
analysis on the relation between the algorithmic performance of RLS and the initialization of S. 
By using the settling time as the performance measure, Moustakides proved that the well-known 
rule of initialization with a relatively small matrix is preferable for cases of high and medium 
signal-to-noise ratio (SNR), whereas for low SNR, a relatively large matrix must be selected for 
achieving best results. In the following learning prediction experiments of RLS-TD(A.), as well as 
the learning control simulation of Fast-AHC, it is observed that the value of the initializing 
constant S also plays an important role in the convergence performance, and the above theoretical 
analyses provide a clue to explain our experimental results. 

4. Learning Prediction Experiments on Markov Chains 

In this section, an illustrative example is given to show the effectiveness of the proposed 
RLS-TD(A.) algorithm. Furthermore, the algorithmic performance under the influence of the 
initializing constant 5 is studied. 

The example is a finite-state absorbing Markov chain called the Hop- World problem (Boyan, 
1999). As shown in Figure 1, the Hop- World problem is a 13 -state Markov chain with an 
absorbing state. 



-3 -3 -3 -3 




[1,0,0,0] | [1/2, 1/2,0,0] [0,0,1/2,1/2] j [0,0,0,1] 
[3/4,1/4,0,0,] [0,0,1/4,3/4] 

Figure 1 : The Hop-World Problem 



In Figure 1, state 12 is the initial state for each trajectory and state is the absorbing state. 
Each non-absorbing state has two possible state transitions with transition probability 0.5. Each 
state transition has reward -3 except the transition from state 1 to state which has a reward of -2. 
Thus, the true value function for state i (0^z^ 12) is -2i. 

To apply linear temporal-difference algorithms to the value function prediction problem, a set 
of four-element state features or basis functions is chosen, as shown in Figure 1. The state 
features of states 12,8,4 and are, respectively, [1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1] and the 
state features of other states are obtained by linearly interpolating between these. 

In our simulation, the RLS-TD(?t) algorithm as well as LS-TD(A.) and conventional linear 
TD( X ) algorithms are used to solve the above value function prediction problem without 
knowing the model of the Markov chain. In the experiments, a trial is defined as the period from 
the initial state 12 to the terminal state 0. The performance of the algorithms is evaluated by the 
averaged root mean squared (RMS) error of value-function predictions over all the 1 3 states. For 
each parameter setting, the performance is averaged over 20 independent Monte-Carlo runs. 
Figure 2 shows the learning curves of RLS-TD(?l) and conventional linear TD(A.) algorithms with 
three different parameter settings. The parameter A. is set to 0.3 for all the algorithms and the 
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step-size parameter of TD(A.) has the following form. 

«„=«o^^ (34) 
N + n 

The above step-size schedule is also studied in Boyan (1999). In our experiments, three 
different settings are used, which are 

(si) a =0.01, N =10 6 

(s2) a =0.01, N =1000 (35) 
(s3) a =0.1, N =1000. 

Different from those in Boyan (1999), the linear TD(A.) algorithms applied here are in their 
online forms, which update the weights after every state transitions. So the parameter n in (34) is 
the number of state transitions. In each run, the weights are all initialized to zeroes. In Figure 2, 
the learning curves of conventional linear TD(X) algorithms with step-size schedules (si), (s2) 
and (s3) are shown by curves 1,2 and 3, respectively. For each curve, the averaged RMS errors of 
value function predictions over all the states and 20 independent runs are plotted for each trial. 
Curve 4 shows the learning performance of RLS-TD(?i). One additional parameter for RLS-TD(?t) 
is the initial value 5 of the variance matrix P . In this experiment, 5 is set to 500, which is a 
relatively large value. From Figure 2, it can be concluded that by making use of RLS methods, 
RLS-TD(A.) can obtain much better performance than conventional linear TD(A.) algorithms and 
eliminates the design problem of the step-size schedules. Other experiments for linear TD(A.) and 
RLS-TD(A.) with different parameters X are also conducted and similar results are obtained when 
the initial values 5 of RLS-TD(?i) are large and the conclusion is confirmed. 



EMS Error of Value Function Prediction 
4 r . . 




50 100 190 Trials 

Figure 2: Performance comparison between RLS-TD(?t) and TD(A.) 

1,2,3 — TD(0.3) with step-size parameters specified by (sl),(s2) and (s3) 
4— RLS-TD(0.3) with initial variance matrix P =5007 

We have done demonstrative experiments to investigate the influence of 5 on the performance 
of the RLS-TD(A.) algorithm. Figure 3 shows the performance comparison between RLS-TD(A.) 
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algorithms using two different initial parameters of the variance matrix P , which are Po = 0.1/ and 
P = 10007, respectively. The forgetting factor is ^=0.995. The performance of the suggested 
algorithm is measured by the averaged RMS errors of the value function prediction in the first 
200 trials over 20 independent runs and all the 13 states. In the experiments, 11 settings of the 
parameter X are tested, which are OAn («=0,1,...,10). 

In Figure 3, it is clearly shown that the performance of RLS-TD(7l) with a large initial value 
of 3 is much better than RLS-TD(>t) with a small initial value of 8. In other experiments with 
different parameter settings of X and 3, similar results are also obtained. We may refer this 
phenomenon to the low SNR case of the forgetting factor RLS studied in Moustakides (1997). 
For the Hop-World problem, the stochastic state transitions could introduce high equation 
residuals A(X t )W -b(X t ) in (19), which corresponds to the additive noise with large variance, 

i.e., the low SNR case. As has been discussed in Section 2, for the forgetting factor RLS in low 
SNR cases, a relatively large initializing constant 8 must be selected for better results. A full 
understanding of this phenomenon is yet to be found. 

RMS Error of Value Function Prediction 



*— ELS-TDCA X <$=Q. 1 
■— ELS-TD (A ), 6 =1000 



Hill ' 



-0.2 0.2 0.4 0.6 0.8 1 1.2 

X 

Figure 3: Performance comparison of RLS-TD(A.) with different initial value of 8 (m=0.995) 

The performance of RLS-TD(?l) with unit forgetting factor /u=\ is also tested in our 
experiments. Although the initial value effect in RLS with pi=l has not been discussed intensively 
(Moustakides, 1997), the same effects of 8 are observed empirically in the case of pt=\ as that in 
H<\, which is shown by Figure 4. 

In our other experiments, it is also found that when 8 is initialized with a small value, the 
performance is sensitive to the values of 8 and the parameter X. In this case, the convergence 
speed of RLS-TD(A.) increases as X increases from to 1, which is shown in Figure 3. 
Furthermore, when X is fixed, the performance of RLS-TD(A.) deteriorates as 3 becomes smaller, 
as shown in Figure 5 . 
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Figure 4: Performance comparison of RLS-TD(X) with different initial value of 3 (w=l) 




In Figure 5, the learning curves of RLS-TD(?l) with different initializing constants 5 are 
shown and compared with that of LS-TD(A.). In the experiment, A. is set to 0.5. From Figure 5, it is 
shown that the performance of RLS-TD(?t) approaches that of LS-TD(A.) when 5 becomes large. 
As is well known, when S becomes large enough, the performance of RLS and LS methods will 
be almost the same. Figure 6 shows the performance comparison between LS-TD(A) and 
RLS-TD(A.) with a large value of S. The initial variance matrix for RLS-TD(?l) is set to 500/ in 
every runs, where / is the identity matrix. 
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Figure 6: Performance comparison of LS-TD(A.) and RLS-TD(A) with fi=\ and large initial 
value of S 



Based on the above experimental results, it can be concluded that the convergence speed of 
RLS-TD( A. ) is mainly influenced by the initial value 5 of the variance matrix and the parameter 
A . Detailed discussions on the properties of RLS-TD( A. ) are given as follows: 

(1) When 5 is relatively large, the effect of A becomes small. If 5 is large enough or goes to 
infinity, the performance of RLS-TD(A) and LS-TD(A) will be almost the same, as was 
discussed above. In such cases, the effect of A on the speed of convergence is insignificant, 
which coincides with the discussion in Boyan (1999). However, as described in Theorem 2, the 
value of A. still affects the ultimate error bound of value function approximation. 

(2) When 3 is relatively small, it is observed that the convergence performance of 
RLS-TD(A.) is different from that of LS-TD(A) and is influenced by the values of both S and A. In 
the experiments of the Hop- World problem, the results show that smaller values of S lead to 
slower convergence. These results may be explained by the theoretical analysis on the transient 
phase of the forgetting factor RLS (Moustakides,1997). According to the theory in Moustakides 
(1997), larger values of S are needed for better performance in the cases of low SNR while 
smaller 5 values are preferable for fast convergence in the cases of high and medium SNR. So 
different values of S must be selected for faster convergence of RLS-TD(A.) in different cases. 
Especially, in some cases, such as the high SNR case discussed in Moustakides (1997), RLS 
methods with small values of 5 can obtain a very fast speed of convergence. 

(3) Compared to conventional linear TD(A.) algorithms, the RLS-TD(A.) algorithm can 
obtain much better performance by making use of RLS methods for value function prediction 
problems. Furthermore, in TD(A), a step-size schedule needs to be carefully designed to achieve 
good performance, while in RLS-TD( A ), the initial value S of the variance matrix can be selected 
according to the criterion of a "large" or a "small" value. 

(4) For the comparison of LS-TD( A.) and RLS-TD( A), which one is preferable depends on 
the objective. In online applications, RLS-TD(A) has advantages in computational efficiency 
because the computation per step for RLS-TD(A.) is 0(K 2 ) and for LS-TD( A.), it is 0(K 3 ), where 



273 



Xu, He, & Hu 



K is the number of state features. Moreover, as will be seen later, RLS-TD(A) can obtain better 
transient convergence performance than LS-TD( A) in some cases. On the other hand, LS-TD( A.) 
may be preferable to RLS-TD( A. ) in the long-term convergence performance, as can be seen in 
Figure 5. And from a system identification point of view, LS-TD(A-) can obtain unbiased 
parameter estimates in face of white additive noises while RLS-TD(A.) with finite S would 
possess large parameter discrepancies. 

5. The Fast-AHC Algorithm and Two Learning Control Experiments 

In this section, the Fast-AHC algorithm is proposed based on the above results on learning 
prediction to solve learning control problems. Two learning control experiments are conducted to 
illustrate the efficiency of Fast-AHC. 

5.1 The Fast-AHC Algorithm 

The ultimate goal of reinforcement learning is learning control, i.e., to estimate the optimal 
policies or the optimal value functions of Markov decision processes (MDPs). Until now, several 
reinforcement learning control algorithms including Q-learning (Watkins and Dayan,1992), 
Sarsa- learning (Singh, et al.,2000) and the Adaptive Heuristic Critic (AHC) algorithm (Barto, 
Sutton and Anderson, 1983) have been proposed. Among the above methods, the AHC method is 
different from Q-learning and Sarsa-learning which are value-function-based methods. In the 
AHC method, value functions and policies are separately represented while in value-function- 
based methods the policies are determined by the value functions directly. There are two 
components in the AHC method, which are called the critic and the actor, respectively. The actor 
is used to generate control actions according to the policies. The critic is used to evaluate the 
policies represented by the actor and provide the actor with internal rewards without waiting for 
delayed external rewards. Since the objective of the critic is policy evaluation or learning 
prediction, temporal-difference learning methods are chosen as the critic's learning algorithms. 
The learning algorithm of the actor is determined by the estimation of the gradient of the policies. 
In the following discussion, a detailed introduction on the AHC method is given. 

Figure 7 shows the architecture of a learning system based on the AHC method. The learning 
system consists of a critic network and an actor network. The inputs of the critic network include 
the external rewards and the state feedback from the environment. The internal rewards provided 
by the critic network are called the temporal-difference (TD) signals. 

As in most reinforcement learning methods, the whole system is modeled as an MDP denoted 
by a tuple {S,A,P,R}, where S is the state set, A is the action set, P is the state transition probability 
and R is the reward function. The policy of the MDP is defined as a function n :S^*Pr(A), where 
Pr(A) is a probability distribution in the action space. The objective of the AHC method is to 
estimate the optimal policy n* satisfying the following equation. 

00 

f = maxJ^ = maxE^y'r t ] (36) 

where y is the discount factor and r, is the reward at time-step t, E M [ ] stands for the expectation 

with respect to the policy n and the state transition probabilities and J n is the expected total 
reward. 
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Figure 7: The AHC learning system 

The value function for a stationary policy n and the optimal value function for the optimal 
policy are defined as follows: 



V c {s) = E n [Y J 7 t r t \s,=s} 



V\s) = EAY J 7 t r t \s =s] 



(37) 



(38) 



(=0 



According to the theory of dynamic programming, the optimal value function satisfies the 
following Bellman equation. 

V\s) = mns[R{s,a) + ]EV\s')] (39) 

a 

where R(s,a) is the expected reward received after taking action a in state s. 

In AHC, the critic uses temporal-difference learning to approximate the value function of the 
current policy. When linear function approximators are used in the critic, the weight update 
equation is 



W t+l =W t +a t [r t +yV(s t+l )-V(s t )]z t 



(40) 



where z, is the eligibility trace defined in (10). 

The action selection policy of the actor is determined by the current state and the value 
function estimation of the critic. Suppose a neural network with weight vector u=[u\, Ui,.. ., u m ] is 
used in the actor, and the output of the actor network is 



y t =f(u,s t ) 



(41) 



The action outputs of the actor are determined by the following Gaussian probabilistic dis- 
tribution. 



Pr (y t ) = exp(- (yt y ' )2 ) 



(42) 



where the mean value is given by (41) and the variance is given by 

c7 ( =V(l + exp(£ 2 r(^)) (43) 

In the above equation, k\ and k 2 are positive constants and V(s t ) is the value function es- 
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timation of the critic network. 

To obtain the learning rule of the actor, an estimation of the policy gradient is given as 
follows: 

= dJ* Sy t ^ ? y, - y, dy, 
du dy, du ' a, du 

where r t is the internal reward or the TD signal provided by the critic: 

f t =r t+r V(s t+l )-V(s t ) (45) 

Since in the AHC method, the critic is used to estimate the value function of the actor's policy 
and provide the internal reinforcement using temporal-difference learning algorithms, the 
efficiency of temporal-different learning or learning prediction will greatly influence the whole 
learning system's performance. Although the policy of the actor is changing, it may change 
relatively slowly especially when fast convergence of learning prediction in the critic can be 
realized. In the previous sections, RLS-TD(A.) is shown to have better data efficiency than 
conventional linear TD( X ) algorithms and a very fast convergence speed can be obtained when 
the initializing constant is chosen appropriately. Thus, applying RLS-TD( A. ) to the policy 
evaluation in the critic network will improve the learning prediction performance of the critic and 
is promising to enhance the whole system's learning control performance. Based on the above 
idea, a new AHC method called the Fast-AHC algorithm is proposed in this paper. The efficiency 
of the Fast-AHC algorithm is verified empirically and detailed analysis of the results is given. 
Following is a complete description of the Fast-AHC algorithm. 



Algorithm 3: The Fast-AHC algorithm 



1: Given: a critic neural network and an actor neural network, which are both linear in 
parameters, a stop criterion for the algorithm. 

2: Initialize the state of the MDP and the learning parameters, set t=0. 
3 : While the stop criterion is not satisfied, 

(3.1) According to the current state s t , compute the output of the actor network y t , 

determine the actual action of the actor by the probability distribution given by 
(42). 

(3.2) Take the action y t on the MDP, and observe the state transition from s t to 
s t+l , set reward r t =r{s t ,s t+l ). 

(3.3) Apply the RLS-TD( X ) algorithm described in (27)-(29) to update the weights of 
the critic network. 

(3 .4) Apply the following equation to update the weights of the actor network, 

dJ 

a M =a t +fi t -^ (46) 
da t 

where 8 t is the learning factor of the actor. 
(3.5) Let t=t+l, return to 3. 
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5.2 Learning Control Experiments on The Cart-Pole Balancing Problem 

The balancing control of inverted pendulums is a typical nonlinear control problem and has been 
widely studied not only in control theory but also in artificial intelligence. In the research of 
artificial intelligence, the learning control of inverted pendulums is considered as a standard test 
problem for machine learning methods, especially for RL algorithms. It has been studied in the 
early work of Michie's BOXES system (Michie,et al.,1968) and later in Barto and Sutton (1983), 
where the learning controllers only have two output values: +10(AO and -10(N). In Berenji, et 
al.(1992) and Lin, et al.(1994), AHC methods with continuous outputs are applied to the cart-pole 
balancing problem. In this paper, the cart-pole balancing problem with continuous control values 
is used to illustrate the effectiveness of the Fast- AHC method. 

Figure 8 shows a typical cart-pole balancing control system, which consists of a cart moving 
horizontally and a pole with one end fixed at the cart. Let x denote the horizontal distance 
between the center of the cart and the center of the track, where x is negative when the cart is in 
the left part of the track. Variable 9 denotes the angle of the pole from its upright position (in 
degrees) and F is the amount of force (AO applied to the cart to move it towards its left or right. So 
the control system has four state variables x,x,6,9 , where x,9 are the derivatives of x and 9 , 
respectively. 

In Figure 8, the mass of the cart is Tl^l.Okg, the mass of the pole is m=0.1kg, the half-pole 
length is /=0.5m, the coefficient of friction of the cart on the track is ,« c =0.0005 and the coefficient 
of friction of the pole on the cart is ^=0.000002. The boundary constraints on the state variables 
are given as follows. 



-12° <9<\T 
- 2.4m < x < 2 Am 

The dynamics of the control system can be described by the following equations. 



(47) 
(48) 



(m + M)g sin 9 - cos 0[F + ml9 2 sin # - /j c sgn(x)] - 




ml 



^(M + m)l- ml cos 2 9 (49) 
F + ml (9 sin 9 - 9 cos 9) - ju c sgn(x) 



M + m 

where g is the acceleration due to the gravity, which is -9.8m/s 2 . The above parameters and 
dynamics equations are the same as those studied in Barto et al. (1983). 




Figure 8: The cart-pole balancing control system 
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In the learning control experiments of the pole-balancing problem, the dynamics (49) is 
assumed to be unknown to the learning controller. In addition to the four state variables, the only 
available feedback is a failure signal that notifies the controller when a failure occurs, which 
means the values of the state variables exceed the boundary constraints prescribed by inequalities 
(47) and (48). It is a typical reinforcement learning problem, where the failure signal serves as the 
reward. Since an external reward may only be available after a long sequence of actions, the critic 
in the AHC learning controller is used to provide the internal reinforcement signal to accomplish 
the learning task. Learning control experiments on the pole-balancing problem are conducted 
using conventional AHC method which uses linear TD(k) algorithms in the critic and the 
Fast- AHC method proposed in this paper. 

To solve the continuous state space problem in reinforcement learning, a class of linear 
function approximators, which is called Cerebellar Model Articulation Controller (CMAC) is 
used. As a neural network model based on the neuro-physiological theory about human 
cerebellar, CMAC was first proposed by Albus (1975) and has been widely used in automatic 
control and function approximation. In CMAC neural networks, the dependence of adjustable 
parameters or weights with respect to outputs is linear. For detailed discussion on the structure of 
CMAC neural networks, one may refer to Albus (1975) and Sutton & Barto (1998). 

In the AHC and Fast- AHC learning controllers, two CMAC neural networks with four inputs 
and one output for each are used as the function approximators in the critic and the actor, 
respectively. Each CMAC has C tilings and M partitions for every input. So the total physical 
memory for each CMAC network is 1/fC To reduce the computation and memory requirements, 
a hashing technique described by the following equations is employed in our experiments. (For 
detailed discussion on the parameters of the CMAC networks, please refer to Appendix C). 

4 

^) = ^[a(z) + M M ] (50) 

i=\ 

F{s)=A{s) mod K (51) 
In (50) and (51), s represents an input state vector, a(i) (0< a(i) <M) is the activated tile for 
the z'-th element of s, K is the total number of the physical memory and F(s) is the physical 
memory address corresponding to the state s, which is the remainder of A(s) divided by K. 

In order to compare the performance of different learning algorithms, the initial parameters of 
each learning controller are selected as follows: The weights of the critic are all initialized to 
and the weights of the actor are initialized to random numbers in interval [0,0.1]. The other 
parameters for the AHC and Fast- AHC algorithms are y = 0.95 , k x = 0.4 and k 2 = 0.5 . 

In all the experiments, a trial is defined as the period from an initial state to a failure state and 
the initial state of each trial is set to a randomly generated state near the unstable equilibrium 
(0,0,0,0) with a maximum distance of 0.05. Equation (49) is employed to simulate the dynamics 
of the system using the Euler method, which has a time step of 0.02s. When a trial lasts for more 
than 120,000 time steps, it is said to be successful and the learning controller is assumed to be 
able to balance the pole. The reinforcement signal for the problem is defined as 

f-1, if failure occurs 
r t =\ (52) 
[ 0, otherwise 

The performance of the Fast-AHC method is tested extensively, where different parameter 
settings including A. and the initial variance matrix P are chosen. In the experiments, the 
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forgetting factor of RLS-TD(7i) in the critic is set to a value that is equal to 1 or very close to 1 . 
The learning control experiments using conventional AHC methods are also conducted for 
comparison. The performance comparisons between the two algorithms are shown in Figure 9, 10 
and 11. 

In the above experiments, the initial variance matrixes of the Fast-AHC algorithm are all set 
to P =0.\I. The performance of Fast-AHC is compared with AHC for different A.. The numbers of 
physical memories of the critic network and the actor network are chosen as 30 and 80, 
respectively. For each parameter setting of the two algorithms, 5 independent runs are tested. The 
performance is evaluated according to the trial number needed to successfully balance the pole. 
The learning factors for the actor networks are all set to 0.5, which is a manually optimized value 
for both algorithms. In all the experiments, 1 1 settings of A. are tested. 
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Figure 9: Performance comparison between Fast-AHC and AHC with a=0.01 
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Figure 10: Performance comparison between Fast-AHC and AHC with a=0.03 
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Figure 11: Performance comparison between Fast- AHC and AHC with a=0.05 



In Figure 9, 10 and 11, the learning factors of the critic networks in AHC are chosen as 
a=0.01, 0.03 and 0.05, respectively. It is found that when a<0.01, the performance of AHC 
becomes worse. For the learning factors that are greater than 0.05, the AHC algorithm may 
become unstable, and even when a=0.03 and a=0.05, the AHC algorithm becomes unstable for 
X=l. For the time-varying learning factors specified in (sl)-(s3), the performance is worse than 
the above constant learning factors. So the above three settings of the learning factor a are typical 
and near optimal for the AHC algorithm. 

From the above experimental results, it can be concluded that by using RLS-TD(A.) in the 
critic network, the Fast-AHC algorithm can obtain better performance than conventional AHC 
algorithms. Although Fast-AHC requires more computation per step than AHC, it is more 
efficient than AHC in that less trials or data are needed to successfully balance the pole. 

As has been discussed in the previous sections, the convergence performance of RLS-TD(A.) 
is influenced by the initial value of the variance matrix. This is also the case in Fast-AHC. In the 
above learning control experiments, a small value <5=0.1 is selected. In other experiments, when 3 
is set to other small values, the performance of Fast-AHC is satisfactory and is better than AHC. 
However, when S is equal to a relatively large value, for example (5=100 or 500, the performance 
of Fast-AHC deteriorates significantly. Since RLS-TD(A.) with a large initializing constant has 
similar performance as LS-TD(A.), it can be deduced that the AHC method using LS-TD(X) in the 
critic will also have bad performance in the cart-pole balancing problem. To verify this, 
experiments are conducted using Fast-AHC with large initializing constant 5 and AHC using 
LS-TD(X). For each parameter setting, 5 independent runs are tested. In the experiments, the 
maximum trials for each algorithm in one run is 200 so that if an algorithm fails to balance the 
pole within 200 trials, its performance is set to 200. When using LS-TD(A.) in the AHC method, 
there may be computational problems in the matrix inversion during the first few steps of learning 
and two methods are tried to avoid this problem. One is the usage of TD(I) in the first 60 steps of 
updates. The other is that the actor is not updated in the early stage of learning until LS-TD(A.) is 
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stable. However, similar results are found for the two methods. Figure 12 shows the experimental 
results which clearly verify that the performance of Fast-AHC with a large initializing constant S 
is similar to AHC using LS-TD(7i) and it is much worse than Fast-AHC with a small S. A detailed 
discussion of this phenomenon is provided in subsection 5.4. 
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Figure 12: Performance comparison of Fast-AHC with different initial variance 

In the following Figure 13 and Figure 14, the variations of the pole angle 6 and the control 
force F are plotted, where a successfully trained Fast-AHC learning controller is used to control 
the cart-pole system. 
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Figure 13: Variation of the pole angle 
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Figure 14: Variation of the control force 



5.3 Learning Control Experiments of The Acrobot 

In this subsection, another learning control example, which is the swing-up control of the acrobot 
in minimum time, is presented. The learning control of the acrobot is a class of adaptive optimal 
control problem that is more difficult than the pole-balancing problem. It has been investigated in 
Sutton (1996), where CMAC -based Sarsa-learning algorithms were employed to solve it and only 
the case of discrete control actions was studied. In our experiments, the case of continuous actions 
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is considered. 

An acrobot moving in the vertical plane is shown in Figure 15, where OA and AB are the first 
link and the second link, respectively. The control torque is applied at point A. The goal of the 
swing-up control is to swing the tip B of the acrobot above the line CD which is higher than the 
joint O by an amount of the length of one link. 




tip 



Figure 1 5 : The acrobot 

The dynamics of the acrobot system is described by the following equations. 

0, =-{d 2 2 +<p l )ld l (53) 

2 = (r + d 1 (j) l ld x -<f> 1 ) (54) 



where 



</, = m x ll + m 2 (/* + l 2 c2 + 2lJ c2 cosd 2 ) + 1, + 1 2 (55) 
d 2 = m 2 (/ c 2 2 + lj c2 cos 6> 2 ) + 1 2 (56) 
&i ~ ~ m 2lJc2@2 srn ^2 ~ 2m 2 / 1 / c2 # 1 # 2 sin^ 2 + {md cX + m 2 l 1 )g cos(d 1 -n l2) + (/> 2 (57) 

<Pi = m 2 l c2 g cos(0 l +0 2 -xl2) (58) 

In the above equations, the parameters t , 6, , m i , /, , I i , l ci are the angle, the angle velocity, 
the mass, the length, the moment of inertia and the length of the center of mass for link i (z'=l,2), 
respectively. 

Let s T denote the goal state of the swing-up control. Since the control aim is to swing up the 
acrobot in minimum time, the reward function r, is defined as 



f 1, if s = s 



r = 1 (59) 

' [0, else ' 

In the simulation experiments, the control torque r is continuous and is bounded by [-37V, 3N], 
Similar to the cart-pole balancing problem, CMAC neural networks are applied to solve the above 
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learning control problem with continuous states and actions. In the CMAC-based actor-critic 
controller, the actor network and the critic network both have C=4 tilings and M=l partitions for 
each input. In the actor network, uniform coding is employed and non-uniform coding is used in 
the critic network. For details of the coding parameters, please refer to Appendix C. The sizes of 
the physical memories for the actor network and the critic network are 100 and 80, respectively. 
In the CMAC networks, the following hashing techniques are used. (For the definition of A(s),a(i) 
and F(s), please refer to Subsection 5.2.) 



^)=£koxm m ] 



(60) 



F(s)=A(s) mod K (61) 
In the simulation, the parameters for the acrobot are chosen as mi=m 2 =lkg, 7 1 =/ 2 =lkgm 2 , 
/ c i=/ c2 =0.5m, li=l 2 =lra and g=9.8m/s 2 . The time step for simulation is 0.05s and the time interval 
for learning control is 0.2s. The learning parameters are 1=0.6, y=0.90, P=0.2, ki=0A, k 2 =0.5. A 
trial is defined as the period that starts from the stable equilibrium and ends when the goal state is 
reached. After each trial, the state of the acrobot is re-initialized to its stable equilibrium. For each 
parameter setting, 5 independent runs are tested. Each run consists of 50 trials and after 50-th trial, 
the actor network is tested by controlling the acrobot alone, i.e., by setting the action variance 
defined in (43) to zero. The performance of the algorithms is evaluated according to the steps 
used by the actor networks to swing up the acrobot. 

The performance comparisons between Fast-AHC and AHC are shown in Figure 16,17 and 
18. In the experiments, both algorithms are tested with different X and AHC is also tested with 
different learning factors of the critic networks. 

From the results, it is also shown that Fast-AHC can achieve higher data efficiency than AHC. 
However, in this example, a relatively large S is used, which is different from the previous 
cart-pole balancing example. In other experiments, good performance is obtained with large 
initializing constant and when 5 is very small, the performance deteriorates significantly. Thus 
this problem may be referred to the low SNR case in Moustakides (1997), where large values of 
8 are preferable for best convergence rate of RLS methods. 
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Figure 16: Performance comparison between Fast-AHC and AHC with a=0.02 
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Figure 17: Performance comparison between Fast-AHC and AHC with a=0.05 
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Figure 18: Performance comparison between Fast-AHC and AHC with a=0.1 



The following Figure 19 shows the performance comparison between Fast-AHC with a large 
(300) and a small (0.01) value of 5 , where 6 settings of the parameter X are tested for each 
algorithm. The performance of AHC using LS-TD(X) is also shown. In Figure 20, a typical curve 
of the angle of the first link is plotted, where the acrobot is controlled by the actor network of the 
Fast-AHC method (X=0.6) after 50 trials. 
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Figure 19: Performance comparison of Fast-AHC and AHC using LS-TD(A,) 




"0 2 4 6 8 tCs) 

Figure 20: Variation of the angle of link 1 (Controlled by Fast-AHC after 50 trials) 



5.4 Analysis of The Experimental Results 

Based on the above experimental results, it can be concluded that by using the RLS-TD(X) 
algorithm in the critic network, the Fast-AHC algorithm can obtain better performance than 
conventional AHC algorithms in that less trials or data are needed to converge to a near optimal 
policy. As is well known, one difficulty for the applications of RL methods is their slow 
convergence, especially in the cases where learning data are hard to be generated. For the 
Fast-AHC algorithm, although more computation per step is required than conventional AHC 
methods, it will not be a serious problem when the number of linear state features is small. In all 
of our learning control experiments, hashing techniques are used to reduce the state features in 
CMAC networks so that the computation of Fast-AHC can be reduced to an economical amount. 
Nevertheless, when the state feature number is large, conventional AHC methods may be 
preferable. 

In the experiments, it is observed that the performance of Fast-AHC is affected by the 
initializing constant S. These results are consistent with the property of RLS-TD(?i) and the RLS 
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method in adaptive filtering, which has been discussed in Section 4. In the learning control 
experiments of the cart-pole balancing problem, better performance of Fast-AHC is obtained by 
using small values of 3. While in the learning control of the acrobot, higher data efficiency is 
achieved using Fast-AHC with a relatively large 6. These two different properties of Fast-AHC 
may be referred to the different SNR cases for RLS methods (Moustakides,1997). A thorough 
theoretical analysis on this problem is an interesting topic for future research. 

In our experiments, the performance of the AHC method using LS-TD(?i) is also tested. As 
has been studied in Section 4, when the initializing constant 5 is large, the performance of 
RLS-TD(A,) and LS-TD(5i) does not differ much. So the performance of AHC using LS-TD(X) is 
similar to that of Fast-AHC with large values of 6. 

As studied in Moustakides (1997), the RLS method can converge much faster than other 
adaptive filtering methods if the environment is stationary and the initializing constant is selected 
appropriately. In some cases, RLS may converge almost instantly. This is also verified in the 
learning prediction experiments of the RLS-TD(?i) algorithm. When applying RLS-TD(X) in an 
actor-critic learning controller, although the policy of the actor will change over time, it can still 
be assumed that the changing speed of the policy is slow when compared with the fast 
convergence speed of RLS-TD(7i). Thus good performance of learning prediction can be obtained 
in the critic. Moreover, since the learning prediction performance of the critic is important to the 
policy learning of the actor, the improvement in learning prediction efficiency will contribute to 
the whole performance improvement of the controller. 

6. Conclusions and Future Work 

Two new reinforcement learning algorithms using RLS methods, which are called RLS-TD(?i) 
and Fast-AHC, respectively, are proposed in this paper. RLS-TD(?i) can be used to solve learning 
prediction problems more efficiently than conventional linear TD( X ) algorithms. The 
convergence with probability 1 is proved for RLS-TD( X ) and the limit of convergence is also 
analyzed. Experimental results on learning prediction problems show that the RLS-TD( X ) 
algorithm is superior to conventional TD( X ) algorithms in data efficiency and it also eliminates 
the design problem of the step sizes in linear TD().) algorithms. RLS-TD(X) can be viewed as 
the extension of RLS-TD(O) from X=0 to general 0<X<\. Although the effect of X on the 
convergence speed of RLS-TD(5i) may not be significant in some cases, the usage of X>0 will 
still affect the approximation error bound. Thus, when there are needs for value function 
estimation with high precision, large values of X are preferable to X=0. Furthermore, RLS- 
TD( X) is superior to LS-TD( X ) in computation when the weight vector must be updated after 
every observations. 

Since learning prediction can be viewed as a sub-problem of learning control, we extend the 
results in learning prediction to a learning control method called the AHC algorithm. Using 
RLS-TD( X ) in the critic network, Fast-AHC can achieve better performance than conventional 
AHC method in data efficiency. Simulation results on the learning control of the pole-balancing 
problem and the acrobot system confirm the above analyses. 

In the experiments, it is found that the performance of RLS-TD(?l) as well as Fast-AHC is 
influenced by the initializing constant S of RLS methods. Different values of 3 are needed for best 
performance in different cases. This is also a well-known phenomenon in RLS-based adaptive 
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filtering and the theoretical results in Moustakides (1997) provide some basis for the explanations 
of our results. A complete investigation of this problem is our ongoing work. 

The idea of using RLS-TD(?i) in the critic network may be applied to other reinforcement 
learning methods with actor-critic architectures. In Konda and Tsitsiklis (1998), a new actor-critic 
algorithm using linear function approximators is proposed and the convergence under certain 
conditions is proved. One condition for the convergence of this algorithm is that the convergence 
rate of the critic is much faster than that of the actor. Thus the application of RLS-TD(?i) in the 
critic may be preferable in order to ensure the convergence of the algorithm. The theoretical and 
empirical work on this problem deserves to be studied in the future. 
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Appendix A. Derivation of the RLS-TD(1) Algorithm 

For the derivation of RLS-TD(^), there are two different cases, which are determined by the value 
of the forgetting factor. 
(1) RLS-TD(X) with a unit forgetting factor. 
Since 

P, = 4 _I (62) 
P =® (63) 
K t+i =P i+1 z t (64) 



According to Lemma 1 , 



P - A~ x 
r t+\ — A t+\ 

= P t - P t z t [1 + (f (x, ) - yf (x t+i ))P t z, )V (f (x, )-yf (x t+l ))P t 

K t +\ — Pt+i z t 

= P t z, l(\ + (f(x 1 )-yf(x t+ J)P t z 1 ) 



W -A h 



t 



1=0 



-P t+l {P- l W t +z t r t ) 



Thus 



(65) 



(66) 



(67) 



W M = P l+l - z, {f{x, ) - yf (x t+l )))W t +z t r t ] 

= W t + P t+l (z,r t - z, (f(x, ) - yf (x, +l ))W t ) (68) 

= W t + K t+i [r t -(f(x t )-yf(x t+ ,W t \ 
(2) RLS-TD(l) with a forgetting factor ii<l 

The derivation of RLS-TD(l) with a forgetting factor (i<l is similar to the exponentially weighted 
RLS algorithm in Haykins (1996, pp.566-569). Here we only present the results: 

K 1+x = P t z, I {/j, + (f(x,)- yf(x t+i ))P t z, ) (69) 
W M =W t + K M (r, -(f(x,)- yf (x, +l ))W t ) (70) 

P t+1 =-[P t -P,z l [ M + (f(x,)-yf(x, +l ))P,z,)y 1 (f(x,)-yf(x, +l ))P l ] (71) 
M 
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Appendix B. Proof of Theorem 1 

To study the steady property of the Markov chain defined in Section 3, we construct a stationary 
process as follows. Let {x t } be a Markov chain that evolves according to the transition matrix P and is 
already in its steady state, which means that ¥r{x,=i}=n: (z) for all i and /. Given any sample path of 
the Markov chain, we define 

Z, = £(M)'~V(*,) (72) 

T=— 00 

Then X t = {x t , x t+l , z t } is a stationary process, which is the same as that discussed in (Tsitsiklis 
and Roy, 1997). 

Let D denote a jVX N diagonal matrix with diagonal entries tt(1), tt(2),..., n (TV), where N is 
the cardinality of state space X. Then Lemma 2 can be derived as follows. 

Lemma 2. (Tsitsiklis and Roy, 1997) Under Assumption 1-4, the following equations hold. 

1) E [<f>(x t )<f>(x t+m )] = <l> T DP m <l> ,for™>0 (73) 

oc- 

2) E,\_z t f{x t )}=YWT<5> T DP m Q>, (74) 

3) E [z l r t (x t , Xt+l )] = f j (yAr® T DP m r (75) 

where r e R N , whose Nth component is equal to E\r{x t , x t+1 )|x, = i] . 

According to Lemma 2, E Q [A(X,)] and E [b(X,)] are well defined and finite. Furthermore, E [A(X,)] 
is negative definite, so it is invertible. 
From equation (67), 

Ww-mw =[ p o l +ZAX t )]- l [P - 1 W +f j b(X t )] 



i=i t=i 



(76) 



Since 



£ [^(X,)]=limiX^) ( 77 ) 



E [K^)] = |im^i;K^) (78) 



and E [A(X,)] is invertible, 



Jim = E-\A(X t )]E [b{X t )] = W* (79) 
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Thus Wrls-td(A) converges to W* with probability 1 . 



Appendix C. Some details of the coding structures of CMAC networks 

In the following discussion, the coding structures of CMAC networks in the cart-pole balancing 
problem and the acrobot control problem are presented. 

(1) CMAC coding structures in the cart-pole balancing problem 

In the CMAC networks, the state variables have the following boundaries. 

9 e [-12° ,12° ] , 9 g [-50 deg/ s, 50 deg/ s] 

xe[-2A, 2.4], xe[-L1] 

For the critic network, C=4 and M=l . The hashing technique specified in equations (50) and (51) 
is employed and the total memory size is 30. 

For the actor network, C=4 and M=l. The hashing technique specified in equations (60) and (61) 
is employed and the total memory size is 100. 

(2) CMAC coding structures in the acrobot swing-up problem 

In the simulation, the angles are bounded by \-n,n\ and the angular velocities are bounded by 

9\ e [-AttAtt] , 9 2 e \-9n,9n\ . The tiling numbers of the actor and the critic both are equal to 4 

(C=4). The total memory sizes for the critic and the actor are 80 and 100, respectively. In the actor 
network, each tiling partitions the range of each input into 7 equal intervals (M=7). In the critic 
network, the partitions for each input are non-uniform, which are given by 

9 l : { -7i, -1, -0.5, 0, 0.5, 1, n}, 9 X : {An, -1.5tc, -0.5n, 0, 0.5ti, 1.5tc, 4ji} 
9 2 : {-71, -1, -0.5, 0, 0.5, 1, 7t}, 9 2 : {-9tc, -2tc, -0.5ti,0, 0.57c,2tc, 9n} 
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